Automatic segmentation of esophageal cancer, metastatic lymph nodes and their adjacent structures in CTA images based on the UperNet Swin network

Abstract Objective To create a deep‐learning automatic segmentation model for esophageal cancer (EC), metastatic lymph nodes (MLNs) and their adjacent structures using the UperNet Swin network and computed tomography angiography (CTA) images and to improve the effectiveness and precision of EC automatic segmentation and TN stage diagnosis. Methods Attention U‐Net, UperNet Swin, UNet++ and UNet were used to train the EC segmentation model to automatically segment the EC, esophagus, pericardium, aorta and MLN from CTA images of 182 patients with postoperative pathologically proven EC. The Dice similarity coefficient (DSC), sensitivity, and positive predictive value (PPV) were used to assess their segmentation effectiveness. The volume of EC was calculated using the segmentation results, and the outcomes and times of automatic and human segmentation were compared. All statistical analyses were completed using SPSS 25.0 software. Results Among the four EC autosegmentation models, the UperNet Swin had the best autosegmentation results with a DSC of 0.7820 and the highest values of EC sensitivity and PPV. The esophagus, pericardium, aorta and MLN had DSCs of 0.7298, 0.9664, 0.9496 and 0.5091. The DSCs of the UperNet Swin were 0.6164, 0.7842, 0.8190, and 0.7259 for T1‐4 EC. The volume of EC and its adjacent structures between the ground truth and UperNet Swin model were not significantly different. Conclusions The UperNet Swin showed excellent efficiency in autosegmentation and volume measurement of EC, MLN and its adjacent structures in different T stage, which can help to T and N stage diagnose EC and will save clinicians time and energy.

Esophageal cancer (EC) is the seventh most common cancer and the sixth leading cause of cancer death worldwide, 1,2 with an estimated 572,000 new cases and 509,000 deaths in 2018, 2 causing serious impacts on human health and quality of life. 3CT scans are crucial for the examination of EC, and accurate segmentation, 3D reconstruction, and 3D morphological quantification of EC on CT images are crucial for the accurate T and N stage diagnosis of EC, the choice of a treatment strategy, and the prognostic assessment of treatment. 4,5Esophagoscopy is the gold standard for EC diagnosis 6 and can confirm the diagnosis of EC.However, it is invasive and expensive, and it cannot be used to T and N stages diagnose EC, and it cannot show the 3D shape and spatial relationship between cancer and adjacent structures.EC can also metastasize in a multidirectional way via lymph nodes in the para-esophageal, neck, abdominal cavity, and mediastinum, 7 so for the diagnosis of EC, metastasis of lymph nodes also needs to be identified.
Multirow CT or CTA images have recently been discovered to increase the precision of EC TNM stage diagnosis. 8,9We are committed to developing automatic segmentation models for EC to lessen the workload of doctors and increase effectiveness because manual segmentation of EC is hard and time-consuming work and prone to segmentation errors.
Currently, tumor volume measurement is mostly used to predict T-stage in colorectal cancer, nasopharyngeal carcinoma and non-small cell lung cancer, which helps in accurate diagnosis and prognostic assessment, but it is less used to predict T-stage in EC. [10][11][12] In our previous study, based on CTA 3D reconstruction of EC, we can clearly observe EC's location, 3D shape and spatial relationship.We also found that EC's volume, major and minor axis are significantly predictable factors to the T-stage diagnosis of the tumor before surgery. 13,14In traditional acknowledge, N stage diagnosis is determined by metastatic lymph node's number.
Omeroglu S et al. showed that in patients with colorectal cancer, an MLN which size was greater than 1.05 cm may predict a poorer prognosis and lower survival in patients with stage III CRC.Maximum MLN size may be used as a surrogate for MLN number when predicting prognosis or staging patients with CRC. 15,16So we think metastatic lymph nodes' volume is correlative with lymph nodes' number.Therefore, tumor and metastatic lymph node's segmentation and 3D calculation can determine EC'T and N stage diagnosis.
In recent years, with the rapid development of deep learning, many automatic segmentation models based on convolutional neural networks have been proposed and widely used for automatic recognition and segmentation of medical images.The U-Net model proposed by O. were 0.6164, 0.7842, 0.8190, and 0.7259 for T1-4 EC.The volume of EC and its adjacent structures between the ground truth and UperNet Swin model were not significantly different.

Conclusions:
The UperNet Swin showed excellent efficiency in autosegmentation and volume measurement of EC, MLN and its adjacent structures in different T stage, which can help to T and N stage diagnose EC and will save clinicians time and energy.

K E Y W O R D S
3D reconstruction, automatic segmentation, computed tomography angiography (CTA), deep learning, esophageal cancer Ronneberger et al. 17 at the MICCAI conference in 2015, the UperNet proposed by Xiao T et al. 18 in 2018, and the Swin Transformer proposed by Liu Z et al. 19 are widely used in the field of computer vision tasks.
The twin transformer network can be used in the field of microscopic feature recognition of alloys, such as in the research of Liu P et al. 20 and can accurately identify the microscopic features of 2.25Cr1Mo0.25V steel for understanding the mechanism of hydrogen embrittlement (HE) and evaluating the anti-HE performance of alloys.Nevertheless, it is rarely used in medical image segmentation.The segmentation of EC and its MLN has rarely been reported.
Jin L et al. 21used CT images to segment EC, and Zhang P. et al. 22 used barium esophagrams to automatically identify EC.They only identified and segmented EC and did not accurately segment the MLN and the adjacent structures around EC, so the accurate segmentation of EC, MLN and its adjacent structures is very important.
Therefore, to improve the segmentation accuracy of EC, increase the segmentation efficiency and shorten the segmentation time, we aim to use the UperNet Swin network to create an intelligent segmentation model for EC, which can help to TN stage diagnosis and treatment decision, and help clinicians accurately diagnose EC in the TN stage, reduce clinicians' workload and improve work efficiency.

| MATERIALS AND METHODS
The flowchart of this experiment is shown in Figure 1.

| Data Information
The dataset used in this experiment is a cohort of 156 patients from December 2018 to September 2022 in the First Affiliated Hospital of Army Medical University (Southwest Hospital) and 26 EC cases from April 2020 to April 2021 in Shanxi Cancer Hospital, and all images were obtained using 256 multirow CT scanners for chest CTA scanning.The study was conducted according to the Declaration of Helsinki, and approved by the Medical Ethics Committee of the First Affiliated Hospital of Army Military Medical University (No.(B)KY2021165).The data that support the findings of this study are available from the corresponding author upon reasonable request.
We enrolled a total of 182 EC patients in the study according to the American Joint Committee on Cancer (AJCC) 8th edition cancer staging guidelines 23 after excluding some data, such as nonthin CTA images, images with poor image quality or artifacts that significantly affected the experiments, and the patients from combing EC and other malignant tumor.We confirmed the number of metastatic lymph nodes by postoperative pathology, and segmented them.A total of 17,205 preoperative CTA images were collected from the 182 patients, and the details of the case information are shown in Table 1.

| Image segmentation and preprocessing
We imported each thin layer enhanced CT image with a thickness of 1-2 mm from the workstation in DICOM format into Amira, manually segmented the EC in the CTA image to obtain an accurate ground truth (GT), and calculated the volume size of the GT.The EC, normal esophagus, MLN, pericardium, aorta, bronchial area and lung in the CTA images were segmented by two experienced imaging physicians, and we also recorded the structures' volume and the time of manual segmentation of each structure.
For the delineation of each tumor region, the gross tumor volume (GTV) was drawn along the outline of the EC around the total tumor volume, and any pixels with attenuation less than −50 HU were excluded to avoid interference from the surrounding adjacent air, fat, blood vessels and bones on the tumor ROI delineation.When there was uncertainty about the tumor area, this region was not included in the tumor delineation. 24,25e preprocessed the input data, including image cropping, intensity normalization, and resolution normalization.The size of the original CTA image was 512 × 512 × 420, and the range of the cancer and its surrounding structures did not exceed 320 × 320 × 16 in all the images.Therefore, we focus on the tumor area in all images, cropped the CTA images to 320 × 320 × 16 and then used them for the deep learning modeling. 21

| Autosegmentation Network
The preprocessed image was input into the UperNet Swin network for modeling.Using the UperNet Swin network with UperNet as the basic framework and the Swin Transformer as the backbone network has significantly enhanced its feature extraction ability.Our study applies this network to the segmentation of EC and its adjacent structures for the first time, and the specific structure is shown in Figure 2.
UperNet was based on the feature pyramid network (FPN). 18FPN is a general-purpose feature extractor with simple performance.The feature map is only upsampled by bilinear interpolation rather than time-consuming deconvolution, and the top-down path is fused with the bottom-up path into 1 × 1 convolutional layers, followed by element-by-element summation, without any complex refinement module.Its simplicity is what achieves its efficiency. 26The function of the PPM (Pyramid pooling module) structure is to make full use of global information to prevent the model from overlearning large-scale goals such as the pericardium while ignoring small-scale goals such as tumors.Image features enter the FPN after passing through the PPM, and in the lower process, the features communicate with those generated in the upper process.Then, the features generated in each layer of the lower process are fused, and the fused image features are passed through convolution and classifier to obtain the final prediction map. 26,27e Swin Transformer is a layered vision deformer that uses a shifted window to serve as a general-purpose backbone for computer vision.Similar to the hierarchical feature maps in convolutional neural networks, such as the feature map size having 4 times, 8 times and 16 times the downsampling of the image, such a backbone helps to build object detection, instance segmentation and other tasks on this basis. 19During the training process, we use data augmentation strategies such as image rotation, scaling, brightness, contrast, gamma, and mirroring to improve the segmentation performance of the model.
All experiments are based on the Python language, PyTorch deep learning framework, and the model was built and trained.During the model training phase, data augmentation was performed using random rotation, random F I G U R E 1 Flow chart of our research.The obtained CTA images are first manually segmented, preprocessed by cropping, windowing, rotation, etc., and then fed into the UperNet Swin network for model training to obtain a prediction map.Finally, the prediction images are used for 3D reconstruction to evaluate the model performance from both 2D and 3D aspects.
horizontal flipping, random brightness, and contrast transformations.Cross entropy loss was used during training, the optimizer selected Adam, the initial learning rate was 0.001, and the GPU for training the model used Quadro RTX 5000.

| Statistical Analysis
We used one-way ANOVA to compare continuous variables and chi-square tests or Fisher's exact tests to compare categorical variables. 28,29Continuous variables are represented as the mean ± standard deviation, and categorical variables are represented as numbers and percentages.
A p-value <0.05 indicated statistical significance.All statistical analyses in this study were performed using SPSS 25.0.We could also calculate the predicted volume of EC based on the EC segmentation results and compared the predicted volume with the volume recorded in the ground truth.

| Model evaluation indicators
We use metrics such as the Dice similarity coefficient (DSC), sensitivity and positive predictive value (PPV) to evaluate the predictive performance of the model. 30 a statistic used to evaluate the similarity of two samples, essentially measuring the overlapping part of two samples and evaluating the performance of the spatial overlap accuracy of the manual segmentation and automatic segmentation methods. 31When the two masks of the test set and the training set coincide, the values of DSC, sensitivity and PPV are 1, and the segmentation results are best.The model segmentation evaluation index is calculated as follows: where GT denotes the manually annotated ground truth, and Pred denotes the model segmented prediction.|Pred| and |GT| represent the areas of Pred and GT, respectively. 30Pred∩GT| denotes the spatial overlap region between Pred and GT.

| Statistical analysis of the dataset
A total of 182 patients were included, and we randomly divided the data into a training set and a test set at a ratio of 8:2.There were no significant differences between the training and test sets in the characteristics of age, sex, T stage, N stage, tumor location, clinical stage, neurovascular invasion, and degree of pathological differentiation (p > 0.05).The patient characteristics are summarized in Table 1, and these characteristics were well balanced between the two cohorts.

| Comparison of the performance of different network models for segmenting EC
We used 4 models for training independently under the same experimental protocol.sensitivity and PPV, respectively, which were superior to the values of the Attention U-Net, Unet++, and Unet models, with DSC values of 0.6479, 0.6761, and 0.6666, respectively.(Figures 3 and 4).
The highest DSC value of EC was 0.7820 on the UperNet Swin network model and 0.6479, 0.6761, and 0.6666 on the Attention U-Net, UNet++, and UNet, respectively.The esophagus had DSC values of 0.7446, 0.7298, 0.7381, and 0.7090 on all four models.The DSC values of the pericardium were above 0.96 on all four models, with the lowest value of 0.9635 for Attention U-Net and the highest value of 0.9664 for UperNet Swin.The highest DSC value for the aorta was 0.9496 for the UperNet Swin, and the lowest value was 0.9439 for the Attention U-Net.The DSC values of MLN were 0.5359, 0.5091, 0.3153, and 0.5141, with the UperNet Swin model having the highest values.(Table 2; Figures 5 and 6).
For the sensitivity of segmentation performance, the sensitivity of the four models for EC segmentation was 0.6452, 0.8033, 0.6896, and 0.7641, with the highest value for the UperNet Swin.The model with the highest sensitivity for pericardium segmentation was the UperNet Swin, with a value of 0.9707.The model with the highest sensitivity for MLN and esophagus segmentation was the Attention U-Net, with values of 0.5965 and 0.7149, respectively.The UNet had the highest sensitivity for aorta segmentation, with a value of 0.9532.
Regarding the values of PPV, the values of EC were 0.6506, 0.7618, 0.6631, and 0.5912, respectively, with the UperNet Swin performing the best.The esophagus had the highest PPV on the UperNet Swin with a value of 0.8100; aorta and LNs had the Unet with the highest PPV with values of 0.9488 and 0.6901, respectively.The pericardium had a PPV of 0.9640, and the attention U-Net had the highest values.
Among the four network models, the overall segmentation results of the UperNet Swin network were better than those of the other models, showing better DSC values of 0.6164, 0.7842, 0.8190 and 0.7259 in T1-T4 staging, respectively.The UNet was better than the other three models in terms of sensitivity to EC (T1-T3), and the UperNet Swin performed better regarding the sensitivity of T4.The PPV values were also higher than those of the other models on both T1 and T3, which were not as good as those of the other models; the attention U-Net showed the best performance on the PPV of T4 (Table 3).

| Comparison of manually segmented and automatically segmented volumes
We compared the automatically segmented volumes of the model with the manually segmented volumes from different T stages and different structures.
The manually segmented volume values for T1-T4 stages were 7. 34  0.89 ± 0.52 cm 3 , respectively, with p = 0.002, and the difference between the two values was statistically significant.(Table 5).

| Segmentation Time Comparison
The trained model takes approximately 18 s to finish EC automatic segmentation, while the manual segmentation time is 1135.clinician or radiologist to accurately segment the volume of each patient, greatly saving physician workload and medical resources.
The UperNet Swin has the highest EC DSC values and performs best in the task of segmenting tumors.This is because the UperNet Swin network uses the Swin Transformer structure in each feature extraction module, and the Swin Transformer module has a strong feature extraction ability, which to a certain extent solves the problems of less feature extraction and poor image quality of the UperNet and improves the image segmentation performance. 18,19We also compared the DSC values of the four models in different structures, including the EC, esophagus, pericardium, aorta, and lymph gland, and found that the pericardium and aorta had the highest DSC, the lymph gland had the lowest DSC value, and the EC and esophagus DSC values were in the middle because the pericardium and aorta structures were relatively large.The structure outline was clear, and the boundary with the surrounding tissue was obvious and was easy to identify in the CTA images.Therefore, its DSC value is relatively higher.Compared with the pericardium and aorta, lymph nodes look very small, with an average of only 4.53 cm 3 , and their locations are not fixed and are not easy to identify, so the DSC value of metastatic lymph nodes is low.

| Comparison with previous studies
Previous studies have shown that deep learning can improve the consistency and save time in the depiction of tumor volume profiles and surrounding organs at risk in cerebral hemorrhage, 32 nasopharyngeal carcinoma, 33 cervical cancer, 34,35 breast cancer, 36,37 rectal cancer, 38,39 pulmonary blood vessels, 40 and lung cancer. 41,42Therefore, one of our main goals was to accurately segment and quantify tumor volume size in EC patients, and the UperNet-Swin performs well in segmenting and measuring EC volume, which is consistent with previous deep learning studies on EC. 21,22,[43][44][45] Our study cannot only accurately quantitatively assess EC but also accurately helps with EC T-stage diagnosis, prognostic assessment and treatment decision-making.
While several previous studies have used machine learning methods to segment tumors in EC patients, they did not segment the MLN of EC and the adjacent structures surrounding the tumor.To automatically identify the GTV outline of 215 EC patients, Jin L et al. 21used three deep learning models: 2DU-Net, 3D V-Net, and VUMix-Net.Their study indicated that the DSC of the VUMix-Net mixed model was marginally higher than that of the single network model, with a DSC value of 0.68.However, in our study, the automatic segmentation of EC using the UperNet Swin network had a DSC value of 0.782, which was higher than their model.We also performed volume analysis on EC and not only EC positioning diagnosis, but we also calculated the EC volume based on the results of automatic segmentation, evaluated EC's T stage, made up for the shortcomings of their research, and more comprehensively located and diagnosed EC.Zhang P. et al. 22 used a deep learning system (DLS) to detect EC on barium esophagrams and dichotomous images with an accuracy of 0.837.Gong EJ et al. 44   Therefore, our advantage is that we selected the optimal model to segment EC, which improves the efficiency of identification and segmentation.
Overall, the UperNet Swin performed best overall in the segmentation task for EC and its adjacent structures.The UNet++ and UNet are better for segmentation of large structures such as the esophagus, pericardium and aorta, while the attention U-Net gives better results for the segmentation of small structures such as MLN.Wu L. et al. 46 used radiomics and CT images for the prediction of EC of the MLN.In contrast to their study technique, we employed a deep learning network for MLN segmentation with a DSC value of 0.5359, which was less successful.Because the MLN was tiny and not easy to identify and locate from connective tissue, autosegmentation was difficult.The fact that the DSC value of the T2-T3 stage is higher than that of the T1 and T4 stage may be due to the larger sample sizes of the T2 and T3 groups.Tumors in T1 are relatively smaller and cannot be autoidentified and autosegmented easily, and tumors in T4 have irregular 2D and 3D morphology, which cannot be autosegmented accurately.The UperNet Swin takes the longest time, probably because it extracts more tumor features, and it takes more time to integrate these features to improve segmentation results and obtain better DSC values.

| LIMITATIONS
First, although our data came from two different centers, the relatively small amount of data from Shanxi Cancer Hospital resulted in less balanced data, so in the future, we will continue to increase the amount of data.Second, the DSC values of the model are still not very good, especially on the upper and lower boundaries of ECs, and when the two tissues of ECs and LNs are close to each other, the network recognizes them poorly.Third, the doctors defined the top and lower limits of EC by combining the findings from CTA, gastroscopy, barium esophageal meal, and even PET-CT.The model's inaccuracy in identifying the upper and bottom bounds of EC is because it could only learn to delineate them from CTA images.

| CONCLUSION
In general, we use the UperNet Swin network to create an intelligent segmentation model for EC, which achieves good results in segmenting EC and its surrounding adjacent

2
UperNet Swin network structure.First, the preprocessed original CTA image and the mask containing the ground truth are input, and feature extraction is carried out through the Back Bone module.The global information is fully utilized through PPM to enter the FPN for feature fusion, and the fused image features are obtained through convolution and classifier to obtain the final prediction map.

4 | DISCUSSION 4 . 1 |F I G U R E 3
Advantages and significance of our deep learning modelIn this study, we propose a deep learning model, UperNet Swin, for automated segmentation and volume analysis of different T stages in EC patients and adjacent structures around EC. Our model achieves good performance in segmenting and measuring the volume of EC, approaching and reaching the radiologist level.In addition, our model is computationally efficient, taking only 18 s to segment and measure the volume of a patient's EC and its surrounding adjacent structures, while it takes up to 1 h for a Comparison of EC predictions in different parts of the upernet switch network model.A1-A3 are the original image, gold standard and model prediction map of the upper EC.B1-B3 are the original image, gold standard and model prediction chart of middle and upper EC.C1-C3 is the original image, gold standard and model prediction map of mid-stage EC.D1-D3 is the original image, gold standard and model prediction map of middle and lower EC.E1-E3 are the original image, gold standard and model prediction map of the lower EC.F I G U R E 4 2D segmentation and 3D reconstruction results of EC with different T staging on four models.A1-D1 are the segmentation results of the gold standard and all models superimposed on one piece.A2-D2 is the segmentation result comparison of the gold standard and attention U-Net model.A3-D3 is the segmentation result comparison of the gold standard and UperNet Swin model.A4-D4 is the segmentation result comparison of the gold standard and Unet++ model.A5-D5 are the 3D reconstruction models of the gold standard; A6-D6 are the 3D reconstruction models of the segmentation result of the UperNet Swin model.
used endoscopic images to establish a deep learning model for diagnosing EC, esophageal F I G U R E 5 Volume comparison of different structures in the UperNet Swin model with the gold standard.F I G U R E 6 Recognition of the tumor boundary by the UperNet Swin model and manual segmentation.A1-A3 and B1-B3 are the mid-thoracic esophageal planes, and the gold standard is the normal esophagus, which is automatically segmented as tumor lesions by the model; C1-C3 and D1-D3 are the tracheal ridge planes, and the gold standard is the tumor and enlarged lymph nodes, which are jointly recognized as tumor by the model.The location of lymph nodes was not clearly distinguished, resulting in excessive tumor volume.
dysplasia and inflammatory esophagus, and the accuracy of the model for diagnosing EC was 0.78.The results of our model agree with the results of Zhang P. et al. and Gong EJ, et al., but they only used one deep learning model to diagnose EC and did not use more models for training.
DSC is Clinical characteristics of patients in training and validation sets.
Note: Date are expressed as number (%) and mean ± SD, depending on variable distribution.ap < 0.05 is considered statistically significant.T A B L E 1

Table 2
reports the results of the comparison.The UperNet Swin network achieved the best performance with a DSC of 0.782 for EC and 0.8033 and 0.7618 for ± 1.78, 17.03 ± 9.70, 23.16 ± 13.88 and 35.41 ± 37.56 cm 3 , and the automatically segmented volume values for the UperNet Swin were 7.90 ± 4.74, 12.59 ± 10.81, 27.74 ± 32.96 and 37.58 ± 16.06 cm 3 , which were not significantly different (p > 0.05).(Table 4).The GT volume of EC was 18.56 ± 2.16 cm 3 , and the predicted volume of the UperNet Swin was 19.31.88 ± 25.49 cm 3 , with p = 0.305 (p > 0.05), indicating no significant difference between them.The GT volumes for the esophagus, pericardium, and aorta were 40.85 ± 8.36, 884.27 ± 398.75 and 166.65 ± 293.88 cm 3 , respectively, and the predicted values were 44.79 ± 11.93, 781.46 ± 157.19, 159.24 ± 31.14 cm 3 , respectively; all of them had p values greater than 0.05, indicating that none of the p values were statistically significant.The predicted GT and MLN volumes were 4.53 ± 8.95 and Quantitative segmentation results of different network models for each structure on the test set.
Abbreviations: DSC, Dice Similarity Coefficient; EC, esophageal cancer; PPV, positive predictive value.Note:The bolded values indicate the highest values of DSC, sensitivity and PPV for different structures of esophageal cancer and its adjacent structures in different networks.
The bolded values indicate the highest values of DSC, sensitivity and PPV in different networks for different T-stages of esophageal cancer.Abbreviations: DSC, Dice Similarity Coefficient; EC, esophageal cancer; PPV, positive predictive value.Volume comparison between automatic and manual segmentation of the UperNet Swin model in different T-stages.agnostic accuracy of 3T MRI, CT and endoscopic ultrasound for preoperative T staging of potentially resectable esophageal cancer.Cancer Imaging.2020;20(1):64.doi:10.1186/s40644-020-00343-w 5. Hong SJ, Kim TJ, Nam KB, et al.New TNM staging system for esophageal cancer: what chest radiologists need to know.Radiographics.2014;34(6):1722-1740. doi:10.1148/rg.346130079 6. Huang B, Xu MC, Pennathur A, et al.Endoscopic resection with adjuvant treatment versus esophagectomy for early-stage Date is mean ± SD. bolded values indicate differences between the volumes of the Upernet Swin network and the hand-segmented MLN.It shows that the network is not very accurate in segmenting small structures, and the difference with manual segmentation is large, which is the main direction of our optimization of the model afterwards.Abbreviations: EC, esophageal cancer; MLN, Metastatic Lymph Nodes.a p < 0.05 is considered statistically significant.Volume comparison of UperNet Swin model with manual segmentation on segmentation of different structures.
Note:Note: Date is mean ± SD. a p < 0.05 is considered statistically significant.T A B L E 4Note:T A B L E 5